Learning visually grounded words and syntax for a scene description task

نویسنده

Deb K. Roy

چکیده

A spoken language generation system has been developed that learns to describe objects in computer-generated visual scenes. The system is trained by a ‘show-and-tell" procedure in which visual scenes are paired with natural language descriptions. Learning algorithms acquire probabilistic structures which encode the visual semantics of phrase structure, word classes, and individual words. Using these structures, a planning algorithm integrates syntactic, semantic, and contextual constraints to generate natural and unambiguous descriptions of objects in novel scenes. The system generates syntactically well-formed compound adjective noun phrases, as well as relative spatial clauses. The acquired linguistic structures generalize from training data, enabling the production of novel word sequences which were never observed during training. The output of the generation system is synthesized using word-based concatenative synthesis drawing from the original training speech corpus. In evaluations of semantic comprehension by human judges, the performance of automatically generated spoken descriptions was comparable to human-generated descriptions. This work is motivated by our long-term goal of developing spoken language processing systems which grounds semantics in machine perception and action. ! 2002 Elsevier Science Ltd. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning language through pictures

We propose IMAGINET, a model of learning visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task objective by receiving a textual description of a scene and trying to concurrently predict its visual representation and the next word in the sentence. Mimicking an...

متن کامل

Learning Visually Grounded Words and Syntax of Natural Spoken Language

Properties of the physical world have shaped human evolutionary design and given rise to physically grounded mental representations. These grounded representations provide the foundation for higher level cognitive processes including language. Most natural language processing machines to date lack grounding. This paper advocates the creation of physically grounded language learning machines as ...

متن کامل

The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings

We motivate and describe a new freely available human-human dialogue data set for interactive learning of visually grounded word meanings through ostensive definition by a tutor to a learner. The data has been collected using a novel, character-by-character variant of the DiET chat tool (Healey et al., 2003; Mills and Healey, submitted) with a novel task, where a Learner needs to learn invented...

متن کامل

From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning

We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities. We show that our mode...

متن کامل

Incremental Generation of Visually Grounded Language in Situated Dialogue (demonstration system)

We present a multi-modal dialogue system for interactive learning of perceptually grounded word meanings from a human tutor (Yu et al., ). The system integrates an incremental, semantic, and bidirectional grammar framework – Dynamic Syntax and Type Theory with Records (DS-TTR1, (Eshghi et al., 2012; Kempson et al., 2001)) – with a set of visual classifiers that are learned throughout the intera...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Computer Speech & Language

دوره 16 شماره

صفحات -

تاریخ انتشار 2002

Learning visually grounded words and syntax for a scene description task

نویسنده

چکیده

منابع مشابه

Learning language through pictures

Learning Visually Grounded Words and Syntax of Natural Spoken Language

The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings

From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning

Incremental Generation of Visually Grounded Language in Situated Dialogue (demonstration system)

عنوان ژورنال:

اشتراک گذاری